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The widely-accepted intuition that the important properties of solids are determined by a few 
key variables underpins many methods in physics. Though this reductionist paradigm is applicable 
in many physical problems, its utility can be limited because the intuition for identifying the key 
variables often does not exist or is difficult to develop. Machine learning algorithms (genetic pro- 
gramming, neural networks, Bayesian methods, etc.) attempt to eliminate the a priori need for 
such intuition but often do so with increased computational burden and human time. A recently- 
developed technique in the field of signal processing, compressive sensing (CS), provides a simple, 
general, and efficient way of finding the key descriptive variables. CS is a new paradigm for model 
building — we show that its models are more physical and predict more accurately than current 
state-of-the-art approaches, and can be constructed at a fraction of the computational cost and user 
effort. 



I. INRODUCTION 

Physical intuition and experience suggest that many 
important properties of materials are primarily deter- 
mined by just a few key variables. For instance, the 
crystal structures of intermetallic compounds have been 
successfully classified into groups (so-called structure 
maps) according to the properties of the constituent 
atomsPS' The widely known Miedema rules relate al- 
loy formation energies to atomic charge densities and 
electronegativities.^ Most magnets can be described us- 
ing a Heisenberg model with only a few short-ranged ex- 
change interactions,^ and the formation energies of multi- 
component alloys can be efficiently parameterized using 
generalized Ising models (cluster expansions) with a fi- 
nite number of pair and multibody interactions.^^ In all 
these cases, enormous gains in efficiency and conceptual 
clarity are achieved by building models which express the 
quantity of interest (typically, total energy) in a simple, 
easy-to-evaluate functional form. These models can then 
be used to perform realistic simulations at finite temper- 
atures, on large systems, and/or over long time scales, 
significantly extending the reach of current state-of-the- 
art quantum mechanics based methods. 

The conventional approach to model building starts by 
selecting a small, physically motivated basis set which de- 
scribes the configuration of the system. The target prop- 
erties are then expressed in terms of these basis functions 
and the unknown coefficients are determined by perform- 
ing least-squares fits to the calculated or experimentally 
measured data. While conceptually simple, this method 
is often difficult to use in practice. First, the number of 
unknown coefficients has to be smaller than the number 
of data points, which precludes the use of very large basis 
sets. Second, least-squares fitting is susceptible to noise, 
and there is often the possibility of "over- fitting" — the 
model is trained to reproduce the fitting data, but per- 



forms poorly in a predictive capacity. Finally, finding 
the the optimal finite basis set is an iVP-hard problem, 
i.e., the solution time increases faster than polynomial 
with the number of possible basis functions. To keep the 
number of coefficients smaller than the amount of data, 
one must choose, based on physical intuition, which basis 
functions to keep. This physical intuition in many cases 
may be unavailable and/or difficult to develop; hence 
there is no clear path to achieve systematic improvement. 
Recent years have seen numerous attempts to use ma- 
chine learning algorithms (genetic programming, neural 
networks, Bayesian methods, eta) to decrease the role 
of intuition in model-buildingPBfUl 

We show that a recently developed technique in the 
field of signal processing, compressive sensing (CS)p^ 
provides a simple, general, and efficient approach to 
model-building. 1 - Instead of attempting to develop phys- 
ical intuition for which coefficients will be most relevant, 
the CS framework allows the inclusion of essentially all 
possible basis functions. Using very large basis sets elim- 
inates the need to use physical intuition to construct 
smaller ones. Furthermore, CS is computationally effi- 
cient for very large problems, robust even for very noisy 
data, and its models predict more accurately than cur- 
rent state-of-the-art approaches. 

II. COMPRESSIVE SENSING: AN 
ILLUSTRATION 

Before demonstrating the power of compressive sens- 
ing for building physical models, we first illustrate the 
concept itself with a simple time series. Discussion of 
compressive sensing requires the definition of £ p norms: 

HIp= (EN P ) - W 
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FIG. 1. (a) A sparse signal 
(blue line) like that of Eq. [5] uni- 
form samples of the signal at the 
Nyquist frequency (red dots), and 
a few random samples (black cir- 
cles). The signal is composed of 
only 3 non-zero frequencies, (b) 
Exact recovery of the frequency 
components of the signal using 
compressive sensing. 



of which the l\ (taxicab or Manhattan distance) and li 
(Euclidean; subscript 2 often omitted) norms are special 
cases. The number of non-zero elements of u is often 
(improperly) referred to as the £q "norm" even though it 
is not a norm in a strict mathematical sense. 

In the signal processing community, compressive sens- 
ing is used to recover sparse signals exactly with far fewer 
samples than required by standard spectral techniques, 
such as the well-known Fourier and Laplace transforms. 
Consider a signal like that shown in Fig. [lja) which has 
the functional form: 



l\ norm of the solution, subject to the constraint given 
by Eq. Q above: 
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where most of the coefficients, u n , are zero (i.e., the sig- 
nal is sparse). The Fourier transform is mathematically 
equivalent to solving the matrix equation 



Au = /, 



(3) 



where the matrix A is formed by the values of the Fourier 
basis functions at the sampling times t m , i.e., it con- 
sists of rows of n terms of the form A mn — e l27m *>™ ; 
and f m = f(t m ) is the sampled signal. The solution 
vector u contains the relative amounts of the different 
Fourier components, as shown in Fig. [ljb) . Capturing 
all relevant frequency components of the signal using 
Fourier transform techniques requires the signal to be 
sampled regularly and at a frequency at least as high as 
the Nyquist frequency [shown as red points in Fig. [TJa)], 
a severe restriction stemming from the requirement that 
the linear system Eq. Q should not be underdetermined. 

However, the main idea of compressive sensing is 
that, when the signal is sparse, one should be able to re- 
cover the exact signal with a number of measurements 
that is proportional to the number of nonzero compo- 
nents, i.e., with far fewer samples than given by the 
Nyquist frequency. Conceptually, this could be done by 
searching for a solution that reproduces the measured 
time signal exactly and has the minimum number of non- 
zero Fourier components. Unfortunately, this formula- 
tion results in a discrete optimization problem, which 
cannot be solved in polynomial time. Compressive sens- 
ing recasts the problem as a simple minimization of the 



min{||it||i : Au = /}, 



(4) 



where ||u||i= \ u i\ ^ s ^ ne ^i-norm defined in Eq. (JTj). In 
other words, one seeks to minimize the sum of the com- 
ponents of the solution vector u subject to the condition 
that the measured signal is reproduced exactly; this con- 
stitutes the so-called basis pursuit problem. Eq. (4]) is 
a convex optimization problem which can be solved effi- 
ciently (see Sec. IV I. We note here that optimization of 
the common sum-of-squares (£2) norm of u would result 
in a dense solution which may deviate considerably from 
the original signal.^ 

As a simple illustration, the exact decomposition of 
an example function, shown in Fig. [T] was possible via 
compressive sensing with only 5 random samples (black 
dots in figure [T]) of the signal, instead of the 24 equally- 
spaced samples (red dots in figure[l]) needed for a discrete 
Fourier transform. Quite generally, a mathematical the- 
orem proven by Candes, Romberg, and Tao^ guarantees 
that, with an overwhelming probability, any sparse sig- 
nal with S nonzero components can be recovered from 
M ~ SdogA^ random measurements, where N is the to- 
tal number of sensing basis functions. This very power- 
ful result is the mathematical foundation of compressive 
sensing. 

Another practically important feature of compressive 
sensing is the ability to tolerate noise in the input data 
and to deal with signals that are only approximately 
sparse, i.e., are dominated by a few large terms, but also 
contain a large number of smaller contributions; this is 
the case in almost all physics applications. It has been 
proven that, if the sensing matrix A obeys the so-called 
restricted isometry property (RIP), an accurate recon- 
struction of the signal from highly under-sampled mea- 
surements can be achieved al so in the presence of both 
random and systematic noise liS12Sl The RIP criterion is 
automatically satisfied if the measurements are chosen 
randomly, (see Sec. IV D for a detailed discussion) 

When applying compressive sensing to model build- 
ing, two tasks must be accomplished: (i) a basis must 
be chosen, and (ii) the coefficients associated with each 
basis function must be determined. Mathematically, the 
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problem is analogous to the simple Fourier example con- 
sidered above, with the sensing matrix A being deter- 
mined by the values of the basis functions at the cho- 
sen measurement points. Below we illustrate the use of 
compressive sensing on two cluster expansion (CE) mod- 
els of configurational energetics!^ (i) Ag-Pt alloys on a 
face-centered cubic (fee) lattice, and (ii) protein folding 
energies in the so-called zinc finger motif. CE is chosen 
as an example because it is conceptually simple, mathe- 
matically rigorous, and widely used in the materials com- 
munity to calculate temperature-composition phase dia- 
grams. Furthermore, CE is a stringent test case for com- 
pressive sensing because a significant amount of effort has 
been expended developing advanced model building tech- 
niques, which have been impleme nted in sophisticated 
general-purpose computer codesP ^ 17 * 21 ^ 2H 

III. CLUSTER EXPANSION 

A. Energy Model 

Since a formal mathematical description of CE can be 
found in the literature, here we only restate its main 
features and refer the reader to Refs. [7HS1 for detailed 
explanations. The CE method uses a complete set of 
discrete basis functions, defined over clusters of lattice 
sites, which describe the occupation of each site and thus 
the entire atomic configuration on the crystal. The total 
energy is given by 

S( ( r) = J Bo+^ri / (a)J / , (5) 
/ 

where / represents symmetrically distinct clusters of lat- 
tice sites (points, pairs, triplets, etc.), a denotes the 
atomic configuration, usually expressed by a collection 
of pseudo-spin variables {Si} describing the type of 
atom at each lattice site, and the cluster correlations 
II f (a) are formed as symmetrized averages of products 
of these pseudo-spin variables. The key quantities in 
this approach are Jt, the effective cluster interactions 
(ECI's): Given the ECI's, the energy of any atomic con- 
figuration on the lattice can be calculated rapidly from 
Eq. ([5]) . Physical intuition based on the concept of "near- 
sightedness" of screened interatomic interactions suggests 
that only clusters within a limited range and involving a 
limited number of sites will have significant ECI's. The 
goal then is to determine which of the clusters /, out of 
the myriads of possible choices, contribute significantly 
to the total energy of the system and to calculate the 
values of these coefficients. 

Currently, the most popular approaches are based on 
the so-called structure inversion method (SIM) P where 
a limited number of quantum-mechanics-based total en- 
ergy calculations are used to determine E(a) on the left- 
hand side of Eq. ^ . The cluster interactions J/ are trun- 
cated according to some recipe and their values are deter- 
mined by least-squares fitting to the training set energies 



E(a). The accuracy of the resulting CE depends crucially 
on the chosen truncation method. Including too few in- 
teractions leads to poor predictive power because impor- 
tant interactions are not accounted for ( "under-fitting" , 
while choosing too many parameters J/ results in spuri- 
ous interactions and an associated decrease in predictive 
accuracy ("over-fitting"). Use of the least-squares fitting 
necessarily requires that the number of structures must 
exceed the number of candidate ECIs, which is the CE 
analogue of the Nyquist frequency in signal processing. 

In modern practice, the trial ECI's are chosen by scan- 
ning over many possible sets of clusters while attempt- 
ing to minimize the predictive error. Ideally, the predic- 
tive error should be calculated as the root mean square 
(RMS) deviation between the density functional theory 
(DFT) and CE-predicted energies over a separate "hold- 
out" set of structures that are not used in fitting. This 
approach would require tens or hundreds of additional 
DFT calculations and is therefore seldom used in prac- 
tice. Leave-one-out cross-validation (LOOCV) or fc-fold 
cross-validation scores are commonly used as proxies for 
the predictive error since they do not require the con- 
struction of a separate hold-out setPil 

Starting from an initial set of ECI's (e.g., empty, point, 
and nearest-neighbor pair clusters), a typical procedure 
for improving the model attempts to add and/or sub- 
stitute clusters into the current set, keeping changes if 
the predictive error is found to decrease. The procedure 
is terminated when none of the attempted changes pro- 
duce an improvement in the predictive accuracy. Unfor- 
tunately, there is no guarantee that this process will truly 
result in the optimal set of ECI's because it is practi- 
cally impossible to solve the AP-hard discrete optimiza- 
tion (DO) problem, especially if the number of candi- 
date ECI's is large, such as required for very accurate 
CE's or in situations of low symmetry (e.g., near de- 
fects, surfaces, nano-particles) . As a result, with DO, one 
may miss important clusters or add clusters that should 
not have been included in the model, which may result 
non-physical ECI values. Additionally, adding/removing 
clusters one-by-one is computationally expensive, requir- 
ing days to finish when considering very large pools of 
candidate clusters. 

Genetic algorithms have been used with some success, 
but they also require large amounts of time to complete, 
especially for large cluster pools and fitting sets, and em- 
ploy a host of tunable paramters ! 23 * 27 ! Design of efficient, 
numerically robust and physically accurate methods for 
selecting the physically significant ECIs remains a chal- 
lenging problem. 

Other researchers, in an attempt to avoid predictive 
errors associated with incomplete discrete optimization, 
have sought to devise direct minimization methods that 
automatically select ECIs only if they are required to re- 
produce the energies of the training set. The first such 
approach was proposed by Laks, Wei, and Zunger for pair 
interactions,^ who added a distance- weighted £2 norm of 
the pair interactions to the objective function. However, 
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this approach usually results in dense sets of long-ranged 
pair ECI's and, more importantly, is difficult to extend 
to multibody interactionsP^HH] Recently, a method based 
on Bayesian statistics was introduced to automatically 
estimate ECI's and shown to outperform several com- 
mon methods in low-symmetry situations.^ However, it 
makes use of physical intuition to construct informative 
prior distributions, which are required for estimating the 
ECI values. It is desirable to develop methods that avoid 
the use of intuition since heuristic rules, derived from ex- 
perience with a few specific systems, may not be univer- 
sally valid. 



B. Compressive sensing cluster expansion (CSCE) 

Here, we show that compressive sensing can be used 
to select the important ECI's and determine their val- 
ues in one shot. The applicability of compressive sensing 
to CE is based on the mathematical theorem of Candes, 
Romberg, and TaopSl which guarantees that sparse ECI's 
can be recovered from a limited number of DFT forma- 
tion energies given certain easy-to-satisfy properties of 
the matrix n in Eq. (J5J). Adopting the common assump- 
tion that the "true" physical ECI's are approximately 
sparse, this theorem guarantees that a good approxima- 
tion will be found even in cases when the data has both 
random and systematic noise, e.g., due to numerical er- 
rors in the DFT calculations or due to interactions be- 



yond the chosen energy resolution, see Sec. IV E 



There are two possible formulations for compressive 
sensing cluster expansion (CSCE), both of which enforce 
the requirement that the cluster expansion should be as 
sparse as possible, while resulting in a certain level of 
accuracy for the training set. In the first approach, one 
may determine the optimal set of ECI's from 



J = arg min • 



|J||l : ||£-nj|| < e} , 



(6) 



where t\ norm of J's is used as a proxy for the number 
of non zero E CI's. Solving the so-called LASSO problem 
Eq. (JfSj 33 3 ^ offers a mathematically strict way of con- 
structing a minimal cluster set that reproduces the train- 
ing set with a given accuracy. Of course, over- (under-) 
fitting is still an issue if e is chosen too small (large) , but 
it is physically reasonable that, given the physical prop- 
erties of the system and the size of the training set, an 
optimal e always exists. Following common practice, op- 
timal e could be found cither by minimizing the LOOCV 
score or the predictive RMS error over a hold-out set. 

Since the inequality constraint is inconvenient to en- 
force during calculations,^ here we follow common prac- 
tice in signal processing and use an unconstrained ap- 
proach which minimizes the sum of an l\ norm of the 
ECI's and a least-squares sum of the fitting errors: 



J = arg min /i| | J||i + -| 



where fj, is a parameter that controls the accuracy of the 
fit versus the sparseness of the solution: higher values 
of fi will result in sparser solutions and larger fitting er- 
rors (under-fitting) , while very small /i values will lead to 
dense solutions and degraded predictive accuracy (over- 
fitting). It will be shown below in Sec. IV E that the 
optimal value of \x is proportional to the level of noise 
(random and systematic) in the calculated formation en- 
ergies. Just like e in Eq. ([6|, an optimal /i to avoid over- 
or under-fitting can be chosen either by minimizing the 
LOOCV score or by minimizing the rms prediction er- 
ror for a separate hold-out set; it is shown below that 
both approaches result in very similar values of optimal 
/j. Furthermore, in Sec. [V] we demonstrate that CSCE 
is not particularly sensitive to the precise value of fi and 
show that there is usually a range of /i's that give ECI's 
of similar predictive accuracy. 

The main advantage of CSCE, Eqs. ^ & 0, over 
current CE methods is that the AP-hard discrete opti- 
mization of the truncated ECI set is replaced by convex 
optimization problems for which exact solutions may be 
found in polynomial time. Furthermore, the minimiza- 
tion of the l\ norm of the solution also serves to decrease 
the magnitude of the ECI's, leading to "smoother" ECI's, 
increased numerical stability with respect to the noise in 
the training data, and eventually more accurate predic- 
tions. In addition, the CSCE is simple to implement and 
use, which will facilitate its widespread adoption in solid 
state physics and other fields where configurational en- 
ergetics play a role. In Sec. [V] below, we illustrate the 
superior performance of CSCE using examples from bulk 
alloys (Ag-Pt) and biology (protein folding energetics). 



IV. PRACTICAL ASPECTS OF ^-BASED 
OPTIMIZATION 

In what follows, we review methods for solving the 
unconstrained minimization problem given by Eq. d7|), 
which we rewrite as: 



min/i||u||i + |||Au-/]| 



(8) 



Eq. ([8]) is referred to as the basis pursuit denoising prob- 
lem. It has a tunable parameter, /i, which controls the 
sparseness of the solution: smaller (larger) values of fi 
produce less (more) sparse solutions. 



A. Fixed-point continuation 

The fixed-point continuation (FPC) method of Hale, 
Yin, and Zhang 3 -^ is an iterative algorithm that starts 
from fir = and attempts to improve the objective func- 
tion by following the gradient of the £2 term: 



E-UJl 



(7) 



f = A T (Au k 



f) 



,k+l 



shrink [ut 



(9) 
(10) 
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where k = 0,1,2,... is the iteration number and the 
shrinkage operator is defined as 

shrink (y, a) := sign(y) max (\y\ — a, 0) . (11) 

In other words, shrinkage decreases the absolute magni- 
tude of y by a and sets y to zero if \y\ < a. The iterations 
are stopped when the €oo norm, or maximum component 
value, of the gradient drops below the shrinkage thresh- 
old, 



- 1 < K 



(12) 



and the change in the solution vector is sufficiently small, 



-tk+l 



< s u . 



(13) 



The sensing matrix should be normalized in such a way 
that the largest eigenvalue a a of A T A is less than or 
equal to 1; this is easily accomplished b y di viding both 
A and / by -J a a- The step size r in Eq. ( 10 ) is given by 



M 

t = min(1.999, -1.665— + 2.665), (14) 

where M and N are the number of equations and the 
number of expansion coefficients, respectively. 



B. Bregman iteration 

While FPC algorithm is generally applicable to any 
problem of type Eq. ^ and is guaranteed to converge, 
in practice it has a serious shortcoming: very small values 
of // are needed to recover the exact solution to the basis 
pursuit problem without noise, Eq. Q, which cause an 
associated increase in the number of FPC iterations. To 
alleviate the need to use small /z's, Yin et a/P^ proposed 
an efficient iterative denoising algorithm for finding the 
solution to Eq. ([8]), which has the additional benefit of 
yielding the exact solution to the basis pursuit problem 
Eq. @ for zero noise. This so-called Bregman iteration 
involves the following two-step cycle: 

/* +1 = /+(/*- AS*), (15) 
u k+1 = axgmin/iMi + UAu - / ?£+1 || 2 , (16) 

starting from f 3 = and u° = 0. A key feature of the 
algorithm is that the residual after iteration k is added 
back to the residual vector f k+1 for the next iteration, 
resulting in efficient denoising and rapid convergence!^ 
Each minimization in Eq. (16) can be performed using 
the fixed-point continuation (FPC) method proposed by 
Hale, Yin, and ZhangP The main advantages of the 
Bregman iteration are faster convergence and the abil- 
ity to use \i values that are several orders of magnitude 
larger than those required for direct application of the 
FPC method. 



C. Split Bregman iteration 

For very large problems (i.e., large sensing matrices 
A), the FPC optimization steps in the Bregman itera- 
tive method progress very slowly. The problem becomes 
severe when the condition number computed from the 
nonzero eigenvalues of A T A becomes large. Indeed, FPC 
is essentially a steepest descent method combined with 
an ii shrinkage step, and the number of required steepest 
descent iterations increases linearly with the ratio of the 
largest-to-smallest eigenvalues of A T Ap3 An improved 
Bregman algorithm, which eliminates the hard-to-solve 
mixed £\ and £2 minimization problem in Eq. (16), was 
proposed by Goldstein and Osher!2£l It carries the name 
of "split Bregman" iteration because it splits off the l\ 
norm of the solution from the objective function and re- 
places it with a new variable d, which is designed to con- 
verge towards the l\ term, limfc_ ) . 00 ((? c — /i-u fc ) = 0. A new 
least-squares £2 term is added to the objective function 
to ensure that d = fj,u in the limit: 



u = argmin||d||i + MAu- f\\ 2 + ^\\d - /j.u\\ 



(17) 



A key advantage of this formulation is that the mini- 
mization involving the quadratic form ||Au — /|| 2 does 
not contain t\ terms and can be performed efficiently 
using standard convex optimization techniques, such as 
Gauss-Seidel or conjugate gradients (CG) Awhile the £\ 
minimization with respect to d at a fixed u contains an £2 
term that is diagonal in the components of d and can be 
solved easily (see below). The full split Bregman iterative 
algorithm proceeds as follows: 

u k+1 = argmin|||Au- f\\ 2 + f ||d* - fiU - 6 fc || 2 ,(18) 



(? s+1 = argmm||tfj|i + fiu k+1 - b k 

b k+1 = b k +^u k+1 -d k +\ 



'k 1 1 2 



(19) 
(20) 



starting from cfr — 0, 6° = 0, and if = 0. We use the 
conjugate gradient method to perform the £2 minimiza- 
tion in Eq. (18). The second step, Eq. (19), separates 
into individual vector components and can be solved ex- 
plicitly using shrinkage as 



shrink (fiu k+1 + b k ,l/\) 



(21) 



The final step of the split Bregman cycle, Eq. (20), adds 
back the residual deficit in the £\ term, in complete anal- 



ogy with the Bregman iteration Eq. (15). The results do 



not depend on the value of the parameter A, although an 
unsuitable choice will lead to very slow or failed conver- 
gence. We find that in practice an optimal A can easily 
be found from a few trial runs at a fixed value of /1, 
and then kept fixed for any [i. Just like FPC, the split 
Bregman iteration provides an exact solution to the basis 
pursuit denoising problem Eq. (|8| , but in contrast to the 
Bregman approach of Sec. IV B| small values of fi may 
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be needed to solve the noiseless basis pursuit problem 
Eq. Q. In practice, we find that the convergence rate of 
the split Bregman method is almost always faster than 
those of the Bregman or FPC algorithms, and greatly so 
for large, ill-conditioned sensing matrices. 



D. Choice of structures for CSCE 

An important practical question regards the best strat- 
egy for choosing structures a to include in the training 
set. Mathematical theorems from compressive sensing 
provide a definite answer to this question. The key idea 
is the notion of coherence between the measurement and 
representation basis. The representation basis $ = {<j>j} 
is used to express the signal as a sparse series expansion 
(e.g., plane waves form the representation basis for the 
Fourier series), while the measurement basis \& = {V>fc} 
contains all possible measurements. For the Fourier ex- 
ample in Sec. [TTJ the measurement basis is given by delta 
functions, i.e., signal values at certain points in time. As- 
suming that both ipj and <pk are normalized and orthog- 
onal, the coherence is defined as the maximum overlap 
between themP*! 



The coherence is given by the maximum scalar product 
between the two, which is 



N max \{4>j,tpk}\ ■ 

3,k 



(22) 



In the Fourier example of Sec. [TTJ the scalar products are 
all \{4>ji^k)\ = N~5 f which corresponds to the lowest 
possible coherence, v = 1. In contrast, the highest pos- 
sible value v = y/N would be obtained by directly mea- 
suring the amplitudes of the individual sinusoidal com- 
ponents of the signal, i.e., if plane waves were chosen as 
the measurement basis functions. Coherence is key in 
determining the number of measurements required to re- 
cover a given sparse signal with S nonzero components: 
the higher the coherence, the higher the required num- 
ber of measurements. More quantitatively, the proba- 
bility of correct signal recovery from M measurements 
exceeds 1 — 5 if the number of measurements satisfies 
M > Cis 2 (§,y)S\og(N/S), where C i s a co nstant and S 
is the number of nonzero components) 40 * 41 ^ a similar re- 
sult holds for compressive sensing in the presence of noise. 
This expression shows that the worst possible strategy 
for recovering sparse signals is to choose the same mea- 
surement basis as the one used in sparse representation 
(y{^^) w vN), since this would require a number of 
measurements equal to the number of unknown coeffi- 
cients, N. 

In cluster expansion, the representation basis are 
formed by symmetry-distinct cluster types and the mea- 
surements are represented by structures a. The corre- 
sponding representation basis functions are Kroenecker 
deltas, 4> 9 {f) — Sf g , where / and g are cluster num- 
bers. The measurements are represented by symmetry- 
inequivalent structures a, and the corresponding basis 
functions are given by normalized rows of the cluster 

correlation matrix, i.e., ip a {f) — Rf(< T )/\Jj2f'^f'( a ) 2 - 



*) = %/iVmax l n /( g )l _ 



(23) 



Because random matrices with independent identically 
distributed (i.i.d.) entries are incoherent with almost any 
representation basis, they occupy a special place in com- 
pressive sensing. If the possible measurements are de- 
signed by selecting N uniformly distributed random vec- 
tors on the unit sphere, followed by subsequent orthog- 
onalization, the coherence between $ and \& is on the 
order of \J2 log A.QS This suggests the following simple 
strategy for selecting structures for CSCE: 

• Generate M uniformly distributed random vectors 
ipaif) on the unit sphere (a = 1, . . . , M) 

• Orthogonalize ip a {f) 

• Match each if> a (f) onto a real structure a with nor- 



malized correlations Hf(a)/ y X}/' ^f'( a ) 2 approx- 
imating ipa(f) as closely as possible 

The last step can be conveniently performed by enumer- 
ating all possible ordered structures up to a certain size 
of the unit cell using the methods of Refs. Pf2l and H31 and 
then choosing the best matches from this list. We stress 
that the somewhat counterintuitive strategy of selecting 
random structures follows from the general mathematical 
properties of l\ -based compressive sensing and represents 
the best possible method for choosing structure sets for 
CSCE. 

Parenthetically we note that selecting structures at 
random makes for a remarkably simple approach to gen- 
erating input data. The "structure selection" problem, 
that is, deciding which structure to use to train the 
model, has been a vexing problem in the cluster expan- 
sion community since cluster expansions first began to 
be trained with first-principles data. At first, structures 
that were easy to calculate (few atoms per unit cell) were 
selected. In later years, more sophisticated approaches 
came to be usecF™^, but a simple, easy-to-implement 
solution has remained elusive. Compressive sensing not 
only solves the "cluster selection" problem (because it 
makes unbiased selections from a huge set of clusters) 
but also overcomes the structure selection problem be- 
cause it dictates that the best strategy is simply to select 
structures at random. 



E. Effect of noise and its relation to optimal /i 

The lone adjustable parameter, fi, should be chosen 
to achieve the optimal balance between the sparseness of 
the ECI's and the RMS fitting error for the training set. 
The effect of /j, on the calculated ECI's is most transpar- 
ently seen by analyzing the FPC equations ([9| and (10 1, 
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which show that /i controls the energy cutoff for the gra- 
dient of the £2 norm of the residuals: components of g 
with absolute values \g*\ < /i will be set to zero by the 
shrinkage operator and therefore will be excluded from 
the model. In what follows, we show that the optimal 
value for /1 is proportional to the level of noise (random 
and systematic) in the training data. 

We first consider the relation between the normalized 
sensing matrix A in Eq. ^ and the CSCE correlation 
matrix II: they are related by A = Il/^/ajj, where a^j 
is the largest eigenvalue of fifi T . The corresponding re- 
lation for the measurement vectors is / = E / ^/ojjj. The 
distributions of the extremal eigenvalues for ideal random 
matrices are known from the theory of principal compo- 
nent analysis.^ However, it is not immediately clear that 
the eigenvalue distributions found for i.i.d. random ma- 
trices will be directly applicable to the CSCE correlation 
matrices fi because the correlation values for real struc- 
tures are neither independent nor identically distributed, 
and hence the entries of ft are only approximately i.i.d. 
We have numerically calculated the distribution of the 
largest eigenvalue of fifi T using subsets of 1 < M < 500 
fcc-based ordered structures with 12 or fewer atoms in 
the unit cell.^ We considered N — 986 correlations (up 
to six-body terms) and averaged the calculated eigenval- 
ues over 1000 subsets randomly drawn from the above 
list of 10850 structures. We find that, for a fixed N, the 
average value of ajj increases linearly with the number 
of structures M. Therefore, A oc li/^fM . 

Random noise: Here we demonstrate that CSCE is 
not only stable with respect to noise in the input data, 
but that it can also filter out the effects of noise on the 
calculated ECI's. We assume that the DFT formation 
energies E(o~) contain random noise which is represented 
by a vector ff e of length M and i.i.d random components 
with variance ejL^. The contribution of fj c to the FPC 
gradient in Eq. (I9J) is given by 



size of the training set as 



6 9f 



1 



M 



M ^ 



(24) 



where the factor 1/M comes from the fact that both the 
sensing matrix A and the measurement vector / are re- 
lated to the correlation matrix fi and input energies E by 



a normalization factor 1/ \ 



If the structures are cho- 



sen rando mly according to the prescription outlined in 
Sec. IVD then fi/(er) € [—1,1] are approximately i.i.d. 
Hence, the individual terms under the summation sign 
in Eq. ( 24 ) will be randomly distributed with a mean of 
zero and a variance proportional to ej? and . To deduce the 
behavior of Sgf in the limit of large M, one can apply the 
central limit theorem (CLT) of classical statistics, which 
states that the average of M random terms is normally 
distributed with a variance that is given by the variance 
of the individual terms divided by M, i.e., the variance of 
Sgf is proportional to eJ? and /M. It then follows from the 
properties of the normal distribution that the average £\ 
norm of the noise term in the gradient decreases with the 



ll<%lli « 



(25) 



This relation demonstrates an important noise-tolerance 
aspect of CSCE, which guarantees that the true physi- 
cal ECI's will be recovered even if the training data sets 
contains uncorrelated random noise of arbitrary magni- 
tude, provided that the number of data points is suffi- 
ciently large. The practical significance of this feature 
cannot be overstated: not only is CSCE stable with re- 
spect to random noise, but an absolute numerical accu- 
racy in the DFT energies is no t ev en needed to recover 
the correct ECI'sPS Equation (25) also offers guidance 
for choosing n to smooth the effect of random noise: as 
long as [i ~ 1 ||i, the contribution of noise to the gra- 



dient will be zeroed out in the shrinkage step [Eq. ( 10 )] 



and will not affect the calculated ECI's. In practice, how- 
ever, the optimal value of /x is difficult to determine using 
Eq. ( 25 ) because the level of noise in the DFT formation 



energies is not known a priori, and approaches based on 
optimizing the predictive error are more practical. 

Systematic noise: We next consider the effect of sys- 
tematic noise due to errors in the ECI's, which we de- 
note by 5Jf. These errors contribute a term 5E(o) — 
^jfi/(cr)5J/ to the residual, and the corresponding er- 
ror in the FPC gradient is given by 

%cx- J2(^r)SJr, (26) 
/' 

where we have introduced a correlation matrix for cluster 
correlations n calculated over the training set: 



1 M - 

(n/S/') = T Z E n /( <7 ) n /'( cr )- 

(7 = 1 



(27) 



This matrix is of fundamental importance for CSCE be- 
cause it describes how the value of one ECI is affected 
by errors in the other ECI's, or the degree of cross- 
contamination between systematic ECI errors. Mini- 
mum sensitivity to cross-contamination is achieved when 
(fi/fi//) is diagonal, but the latter case is impossible to 
realize in practice due to the fact that there are rather 
pronounced correlations between the cluster averages in 
real structures. In the best case scenario, the correla- 
tion matrix (TLfTLfi) will be approximately diagonal if 
the training set structures are chosen randomly accord- 
ing to the algorithm proposed in Sec. |IVD| Indeed, if 
the average cluster correlations n / (er) are approximately 
i.i.d., the off-diagonal elements of the correlation matrix 
(fi/fi/') tend to zero with increasing M, while the diag- 
onal elements remain O(l): 



(H/fl/') 




for / - /' 



for / * f 



(28) 



Hence, in the limit of large M, CSCE based on a ran- 
domly chosen training set cleanly separates the contribu- 
tions of the systematic ECI errors to the gradient, i.e., 




FIG. 2. || Jexact — Jfit ||i (solid) and || Jfit||o (dashed) vs logio/i for the short-ranged pair model with M = 200 (a) and M = 400 
(b). Random uniform noise of ~ 10%(blue circles), 20% ( green squares), and 50% (red "x"s) of the noiseless energies was 
added to the fitting structures, (c) || J cxa ct — Jfit||i vs the number of fitting structures and the noise level. Each point represents 
an average over ~100 different subsets of M structures. 



the ECI error for cluster / only affects the component / 
of the gradient, enabling accurate recovery of the correct 
solution. This is an important feature for any physics 
model-building approach because it guarantees the sta- 
bility of the solution with respect to the interactions that 
are not represented within the chosen basis set. Further- 
more, these considerations offer another insight into the 
physical meaning of the parameter fi: it can be used to 
filter out the cross-contamination due to effects of sys- 
tematic noise if chosen as 



IIWIIc 



M 



(29) 



where ||<5J||oo is the magnitude of the largest error in the 
cluster interactions. Since the diagonal contribution to 
the gradient remains constant with increasing M, succes- 
sively smaller ECI's can be extracted by increasing the 
size of the training set M and simultaneously decreas- 



ing the value of [i according to Eq. ( 29 ) . Unfortunately, 



the practical value of this expression is limited because 
the ECI errors are not known, and approaches based on 
minimizing the prediction errors or CV scores should be 
used instead. 

The preceding analysis shows that \x can be interpreted 
as a parameter controlling the filtering of the noise in the 
calculated energies, including both random noise due to 
numerical errors in the DFT formation enthalpies and 
systematic noise due to cluster interactions that are not 
recoverable using the given structure set. Expressing the 
total noise level as a sum of random and systematic con- 
tributions, e 2 = e 2 an( j + e systi tne effect of both is expected 
to decrease the inverse of the size of the training set, and 
the optimal value of is expected to vary as lyM- We 
note here that the Bregman and split Bregman itera- 
tions contain additional noise-filtering steps [Eqs. ( 15 ) 
and ( 20 )] which add back the residual to the residual of 
the next iteration. As a result, the optimal value of ji 
will in general vary between the different l\ optimization 
approaches, even though the s olution s and the predictive 
errors are practically the samePiESl 



V. APPLICATIONS 
A. Short-ranged pair model with noise 

We first work with an ad-hoc cluster expansion ex- 
ample where we choose a set of sparse coefficients and 
then use them to compute the energies of various crystal 
structures for use as input to CSCE. The advantage of 
this approach is that knowing the exact solution a priori 
allows us to easily determine the accuracy of the solu- 
tion found by CSCE and determine how numerical noise 
influences the performance of the algorithm. While this 
example is certainly not representative of any real al- 
loy system, it clearly illustrates some key features of the 
method, particularly how CS performs with noisy data. 

Using the uncleP^I framework the following clusters on 
an fee lattice were enumerated: 141 pairs, 293 triplets, 
241 four-bodies, 87 five-bodies, and 222 six-bodies (986 
clusters in total, including the onsite and empty clusters) . 
The coefficients of the three shortest nearest-neighbor 
pairs were chosen as 10, 4, and 1, respectively; all other 
coefficients were set to zero. Uniformly distributed ran- 
dom noise equal to ~ 10%, 20%, and 50% of the noise- 
less energies was added to the computed energies E(a). 
We emphasize that these noise levels significantly ex- 
ceed typical numerical errors in the calculated forma- 
tion enthalpies from state-of-the-art quantum mechanics 
codesP^ 

The values of each of the 986 basis functions were com- 
puted for all structures in the training set, thus forming 
the sensing matrix, A. The rows of the sensing matrix, 
A, which each represent a training set structure, were 
constructed by drawing randomly from a uniform distri- 
bution on [—1,1]. For real systems, such as Ag-Pt in 
the next section, these rows should be mapped onto real 
crystallographic configurations as described in Sec. |IVD| 
However since the quality of the fit for the short-ranged 
pair case was found to be unaffected by this mapping, 
either favorably or adversely, we chose to simply use the 
random vectors themselves in order to simplify compu- 
tations. 
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Figures [2^ a) and[2]jb) illustrate the performance of CS 
by showing two quantities: 1) the £i-norm of the differ- 
ence between the exact and fitted coefficients (||J exa ct — 
Jet||i)> and 2) the number of non-zero coefficients (£q- 
norm of the solution, || Jet||o)- We varied /i to investigate 
it's optimal values for a given noise level. Each data point 
in Fig. [2] was obtained by averaging over approximately 
100 different sets, each of size M = 200 or 400. 

The curves in Fig. [2] exhibit a series of plateaus, each 
one indicating a region over which the extracted solu- 
tion remains practically unchanged. Notice, for exam- 
ple, the plateau located between log 10 fi = —0.75 and 
log 10 p, = —0.4 in the || Jfit||o vs. \i curve for M = 200 and 
the lowest noise content (circle markers). This plateau 
indicates that CSCE has extracted three non-zero coef- 
ficients. Furthermore, the value of || J eX act — Jfitlli drops 
close to zero in this range, indicating that CSCE has 
found essentially the exact answer. Using values of \i 
below the optimal range results in sharp increases in 
both the number of nonzero coefficients and in the er- 
ror || J oxact — Jet ||i) indicating overfitting. 

Conversely, \i values above the optimal range result in 
fewer non-zero coefficients and an incremental increase 
in || Jexact — Jfit||i) probably indicating undcrfitting. As a 
function of increasing fi, one first obtains a plateau where 
the CS reproduces the two largest expansion coefficients 
(10 and 4), followed by another plateau where only the 
largest coefficient is reproduced. This example illustrates 
the important point that CS is largely insensitive to the 
choice of fi — the ability to recover the correct solution 
does not depend on the exact value of fi, as long as it lies 
within an optimal, but broad, range. 

Upon increasing the noise in the fitting data at a fixed 
data set size [compare the curves marked by circles and 
squares in Fig. [2|a)], the plateaus in || Jet||o vs A* become 
narrower until the highest plateau, corresponding to full 
recovery of the true solution, disappears completely ( "x" 
markers in Fig. [2]). At the same time, the minimum in 
the error || J exa ct — Jat||i vs. fi is increasing incrementally. 
This displays the robustness and stability of CS — even at 
a very high noise level we are able to recover the majority 
of the signal content. 

The shift towards higher values of optimal fi upon in- 
creasing noise level in Fig. [2] is consistent with the physi- 
cal interpretation of \x as the threshold for noise filtering 



given in Sec. IV E We also note that an increase in the 
number of structures M tends to slightly lower the op- 
timal /i, which can be attributed to a fuller recovery of 
the correct solution and an associated decrease in the 
systematic noise. 

Figure [2jc) displays || J cxac t - Jfit||i, averaged over ap- 
proximately 100 random subsets, as a function of M, the 
number of fitting structures, and the noise level. Here we 
see the same plateau structure found in Fig. ^&), with 
the lower (blue) plateau indicating essentially an exact 
fit. This plot demonstrates that, for all noise levels con- 
sidered (up to as high as 50% of the noiseless energies!), 
there remains a training set size for which the exact so- 
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CSCE predictive errors for Ag-Pt 
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FIG. 3. Root-mean-square errors for the prediction set 
(black line with empty squares) and the leave-one out cross- 
validation score (LOOCV, solid blue line) as functions of the 
parameter fj,. LOOCV has been averaged over 10 randomly 
drawn sets of 100 (400) structures, and the error bars were 
calculated from the variance in the predicted LOOCV scores 
over these sets. Predictive errors for the hold-out set and the 
fitting errors for the training set were averaged over 500 dif- 
ferent sets of 100 (400) structures; the corresponding error 
bars are smaller than the size of the symbols. 



lution will be recovered. 



B. Actual alloy example: Ag-Pt 

Having explained the basic properties of CSCE for 
a model system, we now test its performance on real 
DFT data for binary Ag-Pt alloys on a face-centered 
cubic (fee) lattice. Ag-Pt was chosen due to a report 
of unusual ordering tendencieiP^ which are non-trivial 
to reproduce with current state-of-the-art CE meth- 
ods. The energies of more than 1100 Ag-Pt fee-based 
crystal structures^ were calculated from the density- 
functional theory (DFT) using the vasp software \&iM 
We used projector-augmented-wave (PAW) potentials^ 
and the generalized gradient approximation (GGA) to 
the exchange-correlation functional proposed by Perdew, 
Burke and Ernzerhof!^ To reduce random numerical er- 
rors, equivalent fc-point meshes were used for Brillioun 
zone integration!® Optimal choices of the unit cells, us- 
ing a Minkowski reduction algorithm, were adopted to 
accelerate the convergence of the calculations.^ The ef- 
fect of spin-orbit coupling was not included in our calcu- 
lations because it's effect was shown to be a simple tilt 
of the calculated energies, as explained in Ref. EH 

Out of a total of approximately 1100 structure energies 
calculated for this system, 250 were chosen at random to 
be held out of the fitting process and used for prediction. 
This "holdout" set remained unchanged for all fitting sets 
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chosen. Of the remaining 850 data points available for 
fitting, subsets of up to N = 400 were chosen to be used 
as CSCE training data. 

We start by illustrating the performance of two differ- 
ent methods for selecting the optimal value of /i. First, we 
varied /i and calculated the standard LOOCV score over 
10 different randomly drawn subsets of M structures; the 
results are shown by blue curves in Fig. [3] It is seen that 
the LOOCV scores reach their minima at \i « 4 and 2 
meV/atom for M = 100 and 400, respectively, which we 
interpret as the optimum /i's providing maximum predic- 
tive power. Second, we calculated the average prediction 
errors for all structures left out of the fitting set, which 
are represented by the black dotted lines in Fig. [3j We 
see that the RMS errors for the prediction set largely fol- 
low the same behavior as the LOOCV scores, reaching 
minima at nearly identical /i values. 

As expected, fitting errors for the training set (not 
shown here) decrease monotonically with decreasing fi 
and are significantly smaller than either the LOOCV 
scores or prediction errors for the hold-out set. The lev- 
eling off in both the prediction errors and the LOOCV 
score at small values of \i can be explained by noting that 
CSCE fits the training set perfectly and further decrease 
of fi does not bring about noticeable changes in the cal- 
culated ECI's. We note that this behavior is different 
from the short-ranged pair model in the previous sec- 
tion, where decreasing (i below the optimal range caused 
a rapid deterioration in the accuracy of the calculated 
ECI's. We attribute this difference to the lower level of 
noise in the Ag-Pt case, so that the range of /i's that 
leads to acceptable ECI's is much wider than at the 20- 
50% noise level for the short-ranged pair model. 

To compare the performance of CSCE with other es- 
tablished methods, a discrete optimization (DO) scheme 
as impl ement ed in the state-of-the-art ATAT software 
package} 21 * 22 ^ was used. Note that the ATAT program is 
capable of employing advanced algorithms beyond mini- 
mization of the LOOCV score to ensure that the ground 
state line is reproduced correctly. In order to make a 
straightforward comparison between CSCE and DO, we 
only used the LOOCV-based DO functionality of ATAT. 
Since the DO method for N — 986 clusters on a train- 
ing set of a few hundred structures takes several days to 
complete, averages were taken over only 10 training sets 
of size M (except for M = 400 when we used 42 dif- 
ferent training sets to perform statistical analysis of the 
calculated ECI's). In order to simulate building a compli- 
cated unknown model, we deliberately avoided applying 
physical intuition (e.g., picking short-range interactions) 
and simply performed the optimizations with minimal re- 
strictions. The maximum number of reported ECI's was 
capped to M/4 for ATAT-based DO. For CSCE, we used 
a fixed fi = 8 meV/atom and computed solutions for 500 
randomly chosen training sets of M structures. 

Figure [4] shows a box and whisker plot of the RMS er- 
rors over the prediction set for CS solutions and the mean 
RMS values for the DO solutions (box-and-whiskers were 
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FIG. 4. Results from compressive sensing and leave-one-out 
cross-validation for the fcc-based, Ag-Pt alloy system. The 
solid line gives the root-mean-square (RMS) errors for predic- 
tions made on a constant holdout set for CS(box and whisker) 
and leave-one-out cross-validation (squares). The dashed lines 
give the ^i-norm of the solution vector for both methods. 

not used for DO solutions due to the small number of 
DO fits). Each box and whisker represents RMS values 
for approximately 500 different fits. We see that CSCE 
achieves an RMS error value much lower (2.8 meV/atom) 
than LOOCV-based DO (6.8 meV/atom). Furthermore, 
Fig. [4] shows that the l\ norm of the solution increases 
almost linearly for the DO fit, while it levels off for the 
CSCE fit, indicating that the latter is converging towards 
a stable solution, while the former keeps adding large 
ECI's, a behavior suggestive of over-fitting. 



C. Statistical analysis of Ag-Pt ECI's 

Because CSCE is fast, thousands of fits for many dif- 
ferent training sets can be computed in a few minutes. 
The results of all these fits can be analyzed statistically 
to determine which coefficients are consistently identi- 
fied as contributors and to eliminate artifacts due to a 
particular choice of the training set. This functionality, 
the ability to gather enough data in a reasonable amount 
of time to perform statistical analyses, is a significant 
advantage of CSCE over (slower) DO methods that can 
be used to gain insight into the probability distributions 
for the cluster interactions. These distributions can be 
used to quantify the uncertainty in the CSCE predictions 
for physical properties that go beyond a simple LOOCV 
score or an RMS prediction error. For instance, one can 
draw ECI's from the calculated distributions and gener- 
ate ground state convex hulls with statistical error bars 
on each structure, quantifying the uncertainty in the pre- 
dicted T = K phase diagrams. 

CSCE fits for 500 different fitting set choices were com- 
puted for Ag-Pt. Most of the resulting distributions had 
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FIG. 5. Comparison of the interaction coefficients found using the DO method implemented in ATAT software and compressive 
sensing. The upper pane shows a comparison of two typical fits from CS and ATAT. The lower pane shows the coefficients 
that were found to be statistically relevant from both methods. The x-axis is the cluster radius, which is defined as the 
average distance from the center of mass of all cluster vertices. (Blue dots were placed on the x-axis even for clusters not 
found to be relevant to help the reader know the ordinal number of the relevant clusters.) Physical intuition suggests that 
shorter-radius, fewer-vertex clusters are the most important contributors in alloy energetics. Pair interaction coefficients found 
by both methods are similar. As the number of vertices increases, CS finds coefficients in harmony with physical intuition, 
while DO finds spurious, long-ranged three- and four-body interactions. CS solutions also demonstrate a convergence to one 
specific solution as the size of the fitting set increases, (note: Triplets and quadruplets are shown on a scale from -20 to 20 
meV, different from the scale used for the pairs.) 



only one sharp peak at zero, indicating that, indepen- 
dently of the choice of the training set, they were almost 
never selected by CSCE and therefore should be set to 
zero. Several ECI's exhibited a unimodal distribution 
with nonzero mean, which were interpreted as strongly 
significant nonzero interactions. Finally, a fraction of the 
ECI's showed bi-modal distributions with two peaks of 
comparable weight and one of the peaks centered at zero 
energy. Since the latter ECI's were selected by CSCE 
with an approximately 50% probability, they belong to 
the class of "marginal" interactions which were counted 
as significant only if their distribution mean was greater 
than one standard deviation. To make a fair compari- 



son between CSCE and the DO method implemented in 
the ATAT program, the same statistical criteria for de- 
termining relevant coefficients was used for the DO fits, 
even though data for only 42 fits were available. 

Figure [5] gives a comparison of the CS-determined co- 
efficients and those found by DO. The upper pane com- 
pares a typical DO fit with a typical CSCE fit, while the 
lower pane gives a comparison of statistically relevant 
ECI's from both methods. The CSCE-derived ECI's ap- 
pear to evolve towards one specific solution as the size 
of the fitting set increases, indicating convergence of the 
solution. Notice also that the magnitudes of the CSCE 
coefficients decrease as the spatial extent of the cluster 
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Predicted Ag-Pt formation energies 
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FIG. 6. Predicted CSCE formation energies obtained using 
the ECI's shown in Fig. [5] error bars are standard deviation 
due to different random choices of <= 400 structure subsets. 
Black solid line denotes the convex hull calculated from the 
average energies; only Ag, CayGe-type Ag7Pt (barely, with 
a depth of less than 1 meV/atom), Lli AgPt, and Pt are 
predicted to be T = K ground states. 
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FIG. 7. Predictive performance of CS for protein energetics 
in the zinc- finger structure (shown in the inset). 



D. Protein folding application 



increases and as the number of cluster vertices increases 
(note that triplets and quadruplets are shown on a scale 
from -20 to 20 meV, as opposed to -50 to 50 meV for 
pairs). This is in harmony with long-standing claims in 
the CE community, and it confirms that a stable solu- 
tion has been found. DO-determined clusters follow this 
pattern for pair clusters only. At higher vertex num- 
bers, a typical DO fit finds non-physical, spurious coeffi- 
cients for three- and four- body interactions. The set of 
statistically-relevant DO coefficients appear to be lacking 
several important interactions, specifically short-ranged 
three- and four-body interactions. This indicates that: 
(i) current DO methods are much too slow to be able 
to gather enough statistics to do a meaningful statistical 
analysis, and/or (ii) current DO methods are very sen- 
sitive to the choice of the training set and fall short in 
their ability to identify physically relevant interactions 
without user guidance. 

Figure [6] shows the results of a ground state search 
performed by using the statistically significant M = 400 
coefficients to predict the energies of all fcc-based super- 
structures up to 12 atoms. Error bars were calculated 
from randomly drawn sets of M = 400 structures. The 
ground state line in this figure is consistent with first- 
principles data for this system, which finds the same 
ground states as in Fig. |6j with a few degenerate struc- 
tures lying on the convex hull between c = 0.4 and 0.5. 

This example shows that, in comparison with tradi- 
tional cluster selection methods, CS is not only simpler 
and faster (less than a minute on a single CPU for CS 
versus days for LOOCV at M = 400), but also produces 
more physical solutions that result in a significant im- 
provement in physical accuracy. 



We now turn to a technically much more challenging 
case — that of protein design in biology. Modeling the 
protein folding energies in the zinc-finger motif represents 
a techn ically difficult test case with novel applications in 
biology .L521S0I Q ne f ^ e problem j n protein design is 
to find the sequence of amino acids (AAs) which stabi- 
lizes a particular 3D structure, or folding. Physics-based 
energy functionals are considered to be some of the most- 
promising methods in protein design since they link the 
stability of the the folded 3D structure to the total free 
energy, accurately accounting for electrostatics, van der 
Waals interactions, and solvation effects. However, their 
use is problematic due to the astronomical number of 
possible AA se quences for even very short proteins. It 
was showr! 59 60 that the CE model can be generalized to 
describe protein energetics, allowing very fast direct eval- 
uation of the protein energy as a function of its sequence. 

Here, we use the data from Ref. [SHI for the so-called 
zinc-finger protein fold and closely follow exactly the 
same computational procedures as employed in that 
study. The fitting is done using a basis of approximately 
76,000 clusters and energies of 60,000 AA sequences; a 
separate set of 4,000 AA sequence energies is used to test 
the predictive power of the CE model. The very large 
size of the problem presents a severe test to the conven- 
tional LOOCV-based model building approach, requir- 
ing running times of several weeks on parallel computers 
with user-supervised partial optimizationP^We chose the 
highly efficient split Bregman iteration^ for solving the 
basis pursuit denoising problem in Eq.(j8j), which allows 
us to perform a full optimization in approximately 30 
minutes on a single 2.4 GHz Intel Xeon E5620 proces- 
sor. Figure [7] shows that for the physically important 
negative-energy configurations, we are able to achieve an 
RMS predictive error of 2.1 kcal/mol with 3,100 model 
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parameters, significantly better than the RMS error of 
2.7 kcal/mol with approximately 6,000 parameters ob- 
tained using the LOOCV method in Ref. HH1 Since the 
predictive errors are Gaussian-distributed with a mean 
of zero, the statistical uncertainty in the predictive er- 
ror due to the finite size of the prediction set (> 1000 
negative-energy structures) can be calculated using stan- 
dard statistical formulas for the ^-distribution; they are 
found to be less than 1% of the calculated RMSE. These 
results show that the computational efficiency, concep- 
tual simplicity and physical accuracy of the ^-based min- 
imization shows promise for future applications in protein 
design. 



VI. CONCLUSION 

In conclusion, compressive sensing can be straightfor- 
wardly adopted to build physical models that are dom- 
inated by a relatively small number of contributions 
drawn from a much larger underlying set of basis func- 
tions. Compressive sensing is applicable to any "sparse" 
basis-expansion problem, a broad class of problems in 
physics, chemistry and materials science. Compressive 
sensing allows the identification of relevant parameters 
from a large pool of candidates using a small number of 
experiments or calculations — a real paradigm shift from 
traditional techniques. Furthermore, many other scien- 
tific problems that do not appear to be a basis pursuit 
problem may be recast as one, in which case CS could 
efficiently provide accurate and robust solutions with rel- 
atively little user input. With the huge amount of experi- 



mental and computational data in physical sciences, com- 
pressive sensing techniques represent a promising avenue 
for model building on many fronts including structure 
maps, empirical potential models, tight binding meth- 
ods, and cluster expansions for configurational energies, 
thermodynamics and kinetic Monte Carlo. 

In the arena of cluster expansion, compressive sens- 
ing provides a simple solution to two challenges: "cluster 
selection" and "structure selection." Cluster selection 
is effectively solved because compressive sensing can se- 
lect clusters efficiently from a very large set (thousands 
or tens of thousands). Essentially, it allows the user to 
specify a cluster set so large that it encompasses every 
physically-conceivable interaction. The second challenge, 
structure selection, is overcome by the fact that com- 
pressive sensing requires that input structures simply be 
chosen randomly from configuration space. 
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